## Computer Architecture ELE 475 / COS 475 Slide Deck 2: Microcode and Pipelining Review David Wentzlaff

Department of Electrical Engineering **Princeton University** 





## Agenda

- Microcoded Microarchitectures
- Pipeline Review
  - Pipelining Basics
  - Structural Hazards
  - Data Hazards
  - Control Hazards

## Agenda

- Microcoded Microarchitectures
- Pipeline Review
  - Pipelining Basics
  - Structural Hazards
  - Data Hazards
  - Control Hazards

# What Happens When the Processor is Too Large?

# What Happens When the Processor is Too Large?

Time Multiplex Resources!

## Microcontrol Unit Maurice Wilkes, 1954



### Microcoded Microarchitecture



## A Bus-based Datapath for RISC



Microinstruction: register to register transfer (17 control signals)

## Agenda

- Microcoded Microarchitectures
- Pipeline Review
  - Pipelining Basics
  - Structural Hazards
  - Data Hazards
  - Control Hazards

## An Ideal Pipeline



- All objects go through the same stages
- No sharing of resources between any two stages
- Propagation delay through all pipeline stages is equal
- Scheduling of a transaction entering the pipeline is not affected by the transactions in other stages

## An Ideal Pipeline



- All objects go through the same stages
- No sharing of resources between any two stages
- Propagation delay through all pipeline stages is equal
- Scheduling of a transaction entering the pipeline is not affected by the transactions in other stages
- These conditions generally hold for industry assembly lines, but instructions depend on each other causing various hazards

## Unpipelined Datapath for MIPS



## Simplified Unpipelined Datapath



## Pipelined Datapath



Clock period can be reduced by dividing the execution of an instruction into multiple cycles

$$t_C > max \{t_{IM}, t_{RF}, t_{ALU}, t_{DM}, t_{RW}\} (= t_{DM} probably)$$

## **Pipelined Control**



Clock period can be reduced by dividing the execution of an instruction into multiple cycles

$$t_C > max \{t_{IM}, t_{RF}, t_{ALU}, t_{DM}, t_{RW}\} (= t_{DM} probably)$$

**Pipelined Control** 



Clock period can be reduced by dividing the execution of an instruction into multiple cycles

$$t_C > max \{t_{IM}, t_{RF}, t_{ALU}, t_{DM}, t_{RW}\} (= t_{DM} probably)$$





Clock period can be reduced by dividing the execution of an instruction into multiple cycles

$$t_C > max \{t_{IM}, t_{RF}, t_{ALU}, t_{DM}, t_{RW}\} (= t_{DM} probably)$$

However, CPI will increase unless instructions are pipelined



Clock period can be reduced by dividing the execution of an instruction into multiple cycles

$$t_C > max \{t_{IM}, t_{RF}, t_{ALU}, t_{DM}, t_{RW}\} (= t_{DM} probably)$$

However, CPI will increase unless instructions are pipelined

#### "Iron Law" of Processor Performance

- Instructions per program depends on source code, compiler technology, and ISA
- -Cycles per instructions (CPI) depends upon the ISA and the microarchitecture
- Time per cycle depends upon the microarchitecture and the base technology

| Microarchitecture        | CPI | cycle time |
|--------------------------|-----|------------|
| Microcoded               | >1  | short      |
| Single-cycle unpipelined | 1   | long       |
| Pipelined                | 1   | short      |

#### "Iron Law" of Processor Performance

- Instructions per program depends on source code, compiler technology, and ISA
- -Cycles per instructions (CPI) depends upon the ISA and the microarchitecture
- Time per cycle depends upon the microarchitecture and the base technology

| Microarchitecture                | CPI | cycle time |
|----------------------------------|-----|------------|
| Microcoded                       | >1  | short      |
| Single-cycle unpipelined         | 1   | long       |
| Pipelined                        | 1   | short      |
| Multi-cycle, unpipelined control |     |            |

#### "Iron Law" of Processor Performance

- Instructions per program depends on source code, compiler technology, and ISA
- -Cycles per instructions (CPI) depends upon the ISA and the microarchitecture
- Time per cycle depends upon the microarchitecture and the base technology

| Microarchitecture                | CPI | cycle time |
|----------------------------------|-----|------------|
| Microcoded                       | >1  | short      |
| Single-cycle unpipelined         | 1   | long       |
| Pipelined                        | 1   | short      |
| Multi-cycle, unpipelined control | >1  | short      |

## **CPI Examples**



#### **Unpipelined machine**



3 instructions, 3 cycles, CPI=1

#### Pipelined machine



3 instructions, 3 cycles, CPI=1

## **Technology Assumptions**

- A small amount of very fast memory (caches) backed up by a large, slower memory
- Fast ALU (at least for integers)
- Multiported Register files (slower!)

Thus, the following timing assumption is reasonable

$$t_{\text{IM}} \approx t_{\text{RF}} \approx t_{\text{ALU}} \approx t_{\text{DM}} \approx t_{\text{RW}}$$

A 5-stage pipeline will be the focus of our detailed design

 some commercial designs have over 30 pipeline stages to do an integer add!



We need some way to show multiple simultaneous transactions in both space and time

Pipeline Diagrams: Transactions vs. Time



Pipeline Diagrams: Space vs. Time



# Instructions Interact With Each Other in Pipeline

- **Structural Hazard:** An instruction in the pipeline needs a resource being used by another instruction in the pipeline.
- Data Hazard: An instruction depends on a data value produced by an earlier instruction
- Control Hazard: Whether or not an instruction should be executed depends on a control decision made by an earlier instruction.

## Agenda

- Microcoded Microarchitectures
- Pipeline Review
  - Pipelining Basics
  - Structural Hazards
  - Data Hazards
  - Control Hazards

#### Overview of Structural Hazards

- Structural hazards occur when two instructions need the same hardware resource at the same time
- Approaches to resolving structural hazards
  - Schedule: Programmer explicitly avoids scheduling instructions that would create structural hazards
  - Stall: Hardware includes control logic that stalls until earlier instruction is no longer using contended resource
  - Duplicate: Add more hardware to design so that each instruction can access independent resources at the same time
- Simple 5-stage MIPS pipeline has no structural hazards specifically because ISA was designed that way

## Example Structural Hazard:



## Example Structural Hazard:



### Example Structural Hazard:



## Agenda

- Microcoded Microarchitectures
- Pipeline Review
  - Pipelining Basics
  - Structural Hazards
  - Data Hazards
  - Control Hazards

### Overview of Data Hazards

- Data hazards occur when one instruction depends on a data value produced by a preceding instruction still in the pipeline
- Approaches to resolving data hazards
  - Schedule: Programmer explicitly avoids scheduling instructions that would create data hazards
  - Stall: Hardware includes control logic that freezes earlier stages until preceding instruction has finished producing data value
  - Bypass: Hardware datapath allows values to be sent to an earlier stage before preceding instruction has left the pipeline
  - Speculate: Guess that there is not a problem, if incorrect kill speculative instruction and restart

## Example Data Hazard



 $r1 \leftarrow r0 + 10$  (ADDI R1, R0, #10) r4 ← r1 + 17 (ADDI R4, R1, #17) r1 is stale. Oops!

. . .

### Feedback to Resolve Hazards



- Later stages provide dependence information to earlier stages which can stall (or kill) instructions
- Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interaction from instructions in stages 1 to i (otherwise deadlock)

# Resolving Data Hazards with Stalls

(Interlocks)



. . .

# Stalled Stages and Pipeline Bubbles

time t5 t6 t7 .... t0 t2 t3 t4 t1  $I_2$ ΙF  $I_2$ ID Resource EX nop nop  $I_2$ nop Usage MA nop nop nop  $I_2$ **WB** nop nop nop  $I_2$ 

nop ⇒ pipeline bubble

# Stall Control Logic



Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions.

#### Stall Control Logic (ignoring jumps &branches)



Should we always stall if the rs field matches some rd?

not every instruction writes a register ⇒ we

not every instruction reads a register ⇒ re

# Source & Destination Registers

| F       | R-type:                               | op         | rs     | rt     | rd      |          | unc |          |
|---------|---------------------------------------|------------|--------|--------|---------|----------|-----|----------|
| 1       | -type:                                | ор         | rs     | rt     | imn     | nediate1 | 6   |          |
| J-type: |                                       | op immedia |        |        | ediate: | ate26    |     |          |
|         |                                       |            |        |        | 50      | urce(s)  | des | tination |
| ALU     | $rd \leftarrow (rs) fu$               | nc (rt)    |        |        | 1       | rs, rt   |     | rd       |
| ALUI    | $rt \leftarrow (rs) op$               | imme       | diate  |        |         | rs       |     | rt       |
| LW      | $rt \leftarrow M [(rs) + immediate]$  |            |        |        |         | rs       |     | rt       |
| SW      | $M[(rs) + immediate] \leftarrow (rt)$ |            |        |        | ľ       | rs, rt   |     |          |
| BZ      | cond (rs)                             |            |        |        |         |          |     |          |
|         | <i>true:</i> PC ←                     | (PC)       | + imn  | nediat | e i     | rs       |     |          |
|         | <i>false:</i> $PC \leftarrow$         | (PC)       | + 4    |        |         | rs       |     |          |
| J       | $PC \leftarrow (PC) +$                | - imme     | ediate |        |         |          |     |          |
| JAL     | $r31 \leftarrow (PC)$ ,               | PC ←       | (PC)   | + imn  | nediat  | e        |     | 31       |
| JR      | $PC \leftarrow (rs)$                  |            |        |        |         | rs       |     |          |
| JALR    | $r31 \leftarrow (PC),$                | PC ←       | (rs)   |        |         | rs       |     | 31       |

41

#### Deriving the Stall Signal

```
C_{dest}
ws = Case \text{ opcode}
ALU \Rightarrow rd
ALUi, LW \Rightarrow rt
JAL, JALR \Rightarrow R31
we = Case \text{ opcode}
ALU, ALUi, LW \Rightarrow (ws \neq 0)
JAL, JALR \Rightarrow on
ALU, ALUI, ALR \Rightarrow on
ALU, ALR \Rightarrow on
ALU, ALR \Rightarrow on
```

```
\begin{array}{c} C_{re} \\ re1 = \textit{Case} \text{ opcode} \\ ALU, ALUi, \\ LW, SW, BZ, \\ JR, JALR \Rightarrow on \\ J, JAL \Rightarrow off \\ \\ re2 = \textit{Case} \text{ opcode} \\ ALU, SW \Rightarrow on \\ \Rightarrow off \\ \end{array}
```

```
 \begin{aligned} \text{Stall} &= ((\text{rs}_{\text{D}} = \text{ws}_{\text{E}}).\text{we}_{\text{E}} + \\ &\quad (\text{rs}_{\text{D}} = \text{ws}_{\text{M}}).\text{we}_{\text{M}} + \\ &\quad (\text{rs}_{\text{D}} = \text{ws}_{\text{W}}).\text{we}_{\text{W}}) \cdot \text{re1}_{\text{D}} + \\ &\quad ((\text{rt}_{\text{D}} = \text{ws}_{\text{E}}).\text{we}_{\text{E}} + \\ &\quad (\text{rt}_{\text{D}} = \text{ws}_{\text{M}}).\text{we}_{\text{M}} + \\ &\quad (\text{rt}_{\text{D}} = \text{ws}_{\text{W}}).\text{we}_{\text{W}}) \cdot \text{re2}_{\text{D}} \end{aligned}
```

This is not story

#### Hazards due to Loads & Stores



 $M[(r1)+7] \leftarrow (r2)$  $r4 \leftarrow M[(r3)+5]$ 

Is there any possible data hazard in this instruction sequence?

43

#### Data Hazards Due to Loads and Store

- Example instruction sequence
  - Mem[Regs[r1] + 7] <- Regs[r2]
  - $\text{Regs}[r4] \leftarrow \text{Mem}[\text{Regs}[r3] + 5]$
- What if Regs[r1]+7 == Regs[r3]+5 ?
  - Writing and reading to/from the same address
  - Hazard is avoided because our memory system completes writes in a single cycle
  - More realistic memory system will require more careful handling of data hazards due to loads and stores

#### Overview of Data Hazards

- Data hazards occur when one instruction depends on a data value produced by a preceding instruction still in the pipeline
- Approaches to resolving data hazards
  - Schedule: Programmer explicitly avoids scheduling instructions that would create data hazards
  - Stall: Hardware includes control logic that freezes earlier stages until preceding instruction has finished producing data value
  - Bypass: Hardware datapath allows values to be sent to an earlier stage before preceding instruction has left the pipeline
  - Speculate: Guess that there is not a problem, if incorrect kill speculative instruction and restart

#### Adding Bypassing to the Datapath



When does this bypass help?

$$(I_1)$$
  $r1 \leftarrow r0 + 10$   $r1 \leftarrow Mem[r0 + 10]$   $r1 \leftarrow Mem[r0 + 10]$   $r1 \leftarrow r4 \leftarrow r1 + 17$   $r4 \leftarrow r31 + 17$ 

#### Deriving the Bypass Signal

Each stall or kill introduces a bubble in the pipeline

$$\Rightarrow$$
 CPI > 1

A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input

# The Bypass Signal

Deriving it from the Stall Signal

```
stall = (\frac{((rs_D = ws_E).we_E + (rs_D = ws_M).we_M + (rs_D = ws_W).we_W).re1_D + ((rt_D = ws_E).we_E + (rt_D = ws_M).we_M + (rt_D = ws_W).we_W).re2_D)
```

```
ws = Case opcode

ALU \Rightarrow rd

ALUi, LW \Rightarrow rt

JAL, JALR \Rightarrow R31
```

```
we = Case \text{ opcode}
ALU, ALUi, LW \Rightarrow (ws \neq 0)
JAL, JALR \Rightarrow on
\Rightarrow off
```

$$ASrc = (rs_D = ws_E).we_E.re1_D$$

Is this correct?

No because only ALU and ALUi instructions can benefit from this bypass

Split we<sub>E</sub> into two components: we-bypass, we-stall

# Bypass and Stall Signals

#### Split we<sub>F</sub> into two components: we-bypass, we-stall

```
we-bypass_E = Case \text{ opcode}_E
ALU, ALUi \Rightarrow (ws \neq 0)
\cdots \Rightarrow off
```

```
ASrc = (rs_D = ws_E).we-bypass_E . re1_D
```

```
stall = ((rs_D = ws_E).we-stall_E + (rs_D = ws_M).we_M + (rs_D = ws_W).we_W). re1_D + ((rt_D = ws_E).we_E + (rt_D = ws_M).we_M + (rt_D = ws_W).we_W). re2_D
```

#### Fully Bypassed Datapath



a need for the stall signal?

stall = 
$$(rs_D=ws_E)$$
.  $(opcode_E=LW_E)$ . $(ws_E\neq 0)$ . $re1_D + (rt_D=ws_E)$ .  $(opcode_E=LW_E)$ . $(ws_E\neq 0)$ . $re2_D$ 

#### Overview of Data Hazards

- Data hazards occur when one instruction depends on a data value produced by a preceding instruction still in the pipeline
- Approaches to resolving data hazards
  - Schedule: Programmer explicitly avoids scheduling instructions that would create data hazards
  - Stall: Hardware includes control logic that freezes earlier stages until preceding instruction has finished producing data value
  - Bypass: Hardware datapath allows values to be sent to an earlier stage before preceding instruction has left the pipeline
  - Speculate: Guess that there is not a problem, if incorrecturse kill speculative instruction and restart

#### Agenda

- Microcoded Microarchitectures
- Pipeline Review
  - Pipelining Basics
  - Structural Hazards
  - Data Hazards
  - Control Hazards

#### **Control Hazards**

- What do we need to calculate next PC?
  - For Jumps
    - Opcode, offset and PC
  - For Jump Register
    - Opcode and Register value
  - For Conditional Branches
    - Opcode, PC, Register (for condition), and offset
  - For all other instructions
    - Opcode and PC
      - have to know it's not one of above!

#### Opcode Decoding Bubble

(assuming no branch delay slots for now)

```
time
                         t2 t3 t4 t5 t6 t7 ....
                t0
                    t1
                    nop I_2 nop I_3 nop I_4
            IF
            ID
                         nop I_2 nop I_3 nop I_4
Resource
                             nop I_2 nop I_3 nop I_4
            EX
Usage
            MA
                             I_1 nop I_2 nop I_3 nop I_4
            WB
                                  I_1 nop I_2 nop I_3 nop I_4
            CPI = 2!
                                       nop ⇒ pipeline bubble 54
```

#### Speculate next address is PC+4





A jump instruction kills (not stalls) the following instruction

How?

# Pipelining Jumps



#### Jump Pipeline Diagrams

```
time
                                 t1 t2 t3 t4 t5 t6 t7
                          t0
(I_1) 096: ADD
                                 ID<sub>1</sub> EX<sub>1</sub> MA<sub>1</sub> WB<sub>1</sub>
                         \mathsf{IF}_1
                                 IF<sub>2</sub> ID<sub>2</sub> EX<sub>2</sub> MA<sub>2</sub> WB<sub>2</sub>
(I_2) 100: J 304
                                        IF<sub>3</sub> hop nop nop nop
(I_3) 104: ADD
                                              IF<sub>4</sub> ID<sub>4</sub> EX<sub>4</sub> MA<sub>4</sub> WB<sub>4</sub>
(I_4) 304: ADD
                          time
                                 t1 t2 t3 t4 t5 t6 t7 ....
                          t0
                    ΙF
                                 I_2 I_3 I_4 I_5
                                       I_2 nop I_4 I_5
                    ID
Resource
                    EX
                                             I_2 nop I_4 I_5
Usage
                    MA
                                                     I_2 nop I_4 I_5
                    WB
                                                            I_2 nop I_4 I_5
```

# Pipelining Conditional Branches





Branch condition is not known until the execute stage

what action should be taken in the decode stage?

58

# Pipelining Conditional Branches



If the branch is taken

- $I_1$  096 ADD  $I_2$  100 BEQZ r1 +200  $I_3$  104 ADD 108 ...  $I_4$  304 ADD
- kill the two following instructions
- the instruction at the decode stage is not valid

# Pipelining Conditional Branches



If the branch is taken

- $I_1$  096 ADD  $I_2$  100 BEQZ r1 +200  $I_3$  104 ADD 108 ...  $I_4$  304 ADD
- kill the two following instructions
- the instruction at the decode stage is not valid

#### New Stall Signal

```
stall = ( ((rs_D = ws_E).we_E + (rs_D = ws_M).we_M + (rs_D = ws_W).we_W).re1_D 
+ ((rt_D = ws_E).we_E + (rt_D = ws_M).we_M + (rt_D = ws_W).we_W).re2_D )
.!((opcode_E = BEQZ).z + (opcode_E = BNEZ).!z)
```

Don't stall if the branch is taken. Why?

Instruction at the decode stage is invalid

#### Control Equations for PC and IR Muxes

```
\begin{split} IRSrc_D &= \textit{Case} \; \text{opcode}_E \\ BEQZ.z, \; BNEZ.!z &\Rightarrow \text{nop} \\ \dots &\Rightarrow \\ \textit{Case} \; \text{opcode}_D \\ &\quad J, \; JAL, \; JR, \; JALR \Rightarrow \text{nop} \\ \dots &\Rightarrow IM \end{split}
```

Give priority
to the older
instruction,
i.e., execute-stage
instruction
over decode-stage
instruction

```
\begin{split} IRSrc_{E} &= \textit{Case} \; \text{opcode}_{E} \\ &= \text{BEQZ.z, BNEZ.!z} \quad \Rightarrow \text{nop} \\ &= \cdots \quad \Rightarrow \text{stall.nop + !stall.IR}_{D} \end{split}
```

#### Branch Pipeline Diagrams

(resolved in execute stage)

```
time
                      t0
                                t2 t3 t4 t5 t6 t7 ....
                           t1
(I_1) 096: ADD IF_1 ID_1 EX_1 MA_1 WB_1
(I_2) 100: BEQZ +200 IF_2 ID_2 EX_2 MA_2 WB_2
(I_3) 104: ADD
                                 IF<sub>3</sub> ID<sub>3</sub> hop nop nop
    108:
                                      IF<sub>4</sub> hop nop nop nop
(I_5) 304: ADD
                                            IF<sub>5</sub> ID<sub>5</sub> EX<sub>5</sub> MA<sub>5</sub> WB<sub>5</sub>
                      time
                           t1 t2 t3 t4 t5 t6 t7 ....
                      t0
                           I_2 I_3 I_4 I_5
                IF
                     I_1
                ID
                           I_1 \quad I_2 \quad I_3 \quad \text{nop } I_5
Resource
                                   I_2 nop nop I_5
                EX
Usage
                MA
                                            I_2 nop nop I_5
                WB
                                                 I_2 nop nop I_5
                                                   nop ⇒ pipeline bubble
```

#### Reducing Branch Penalty

(resolve in decode stage)

- One pipeline bubble can be removed if an extra comparator is used in the Decode stage
  - But might elongate cycle time



Pipeline diagram now same as for jumps

#### **Branch Delay Slots**

(expose control hazard to software)

- Change the ISA semantics so that the instruction that follows a jump or branch is always executed
  - gives compiler the flexibility to put in a useful instruction where normally a pipeline bubble would have resulted.

 Other techniques include more advanced branch prediction, which can dramatically reduce the branch penalty... to come later

#### Branch Pipeline Diagrams

(branch delay slot)

# Why an Instruction may not be dispatched every cycle (CPI>1)

- Full bypassing may be too expensive to implement
  - typically all frequently used paths are provided
  - some infrequently used bypass paths may increase cycle time and counteract the benefit of reducing CPI
- Loads have two-cycle latency
  - Instruction after load cannot use load result
  - MIPS-I ISA defined load delay slots, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). Removed in MIPS-II (pipeline interlocks added in hardware)
    - MIPS: "Microprocessor without Interlocked Pipeline Stages"
- Conditional branches may cause bubbles
  - kill following instruction(s) if no delay slots

Machines with software-visible delay slots may execute significant number of NOP instructions inserted by the compiler. NOPs not counted in useful CPI (alternatively, increase instructions/program)

#### Other Control Hazards

- Exceptions
- Interrupts

More on this later in the course

#### Agenda

- Microcoded Microarchitectures
- Pipeline Review
  - Pipelining Basics
  - Structural Hazards
  - Data Hazards
  - Control Hazards

#### Acknowledgements

- These slides contain material developed and copyright by:
  - Arvind (MIT)
  - Krste Asanovic (MIT/UCB)
  - Joel Emer (Intel/MIT)
  - James Hoe (CMU)
  - John Kubiatowicz (UCB)
  - David Patterson (UCB)
  - Christopher Batten (Cornell)
- MIT material derived from course 6.823
- UCB material derived from course CS252 & CS152
- Cornell material derived from course ECE 4750

Copyright © 2013 David Wentzlaff